Twitter Study on 2016 Election

Introduction

Social media is now an essential part of modern era and people have become used to share their thoughts and opinions on social media such as Twitter and LinkedIn. In this exploratory and preliminary study, we use the approximately 1 million tweets collected between Oct. 19th, 7:30pm and Oct. 20th, 1:30am to try and predict the presidential election outcome. Note that the debate started ~9:00pm on Oct. 19th. In addition to providing more insights on how the social media reacts to the presidential debate, this study also serves as an exploration on the performance of Word2Vec module in Apache Spark's spark.ml package.

Method

Computing Infrastructure

The calculations were carried out on a homogeneous (Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz) stand-alone Spark cluster consisting of a master node (2 cpu, 4GB RAM) and two slave nodes (8 cpu, 6GB RAM), which are all running Scientific Linux 6 as the operating system. Since this minimal cluster was temporarily built for performing this study using the desktop computers in my office, no job scheduler, such as YARN and Mesos was used. The actual Spark application is this Jupyter Notebook, which can be considered as a PySpark shell.

In this study, tweets gathered directly via Twitter Streaming API are in JSON format and each tweet may have dramatically different fields. Hence flexibility on schema design is important in choosing the database for warehousing and serving the collected tweets. MongoDB is a NoSQL database, which provides flexibility on the schema. Although it is also highly scalable and secure via sharding and replica sets, in this study one single MongoDB instance is sufficient considering the data size and streaming bandwidth limit imposed by Twitter.

Text Mining

All the tweets are separated into two sections: before the 3rd presidential debate and in/post the debate. Technology wise, Python is used as the main programming language and standard Python tools, such as NumPy and Pandas, are used for data munging. Usually on twitter, people tend to use abbreviated words and phrases, which makes the text cleaning a challenge. So here NLTK was used for word stemming and stop-word removal. The tweets text was analyzed using Word2Vec NLP algorithm (skip-gram model) and KMeans clustering algorithm. Finally, the data visualization is carried out using plotly and ggplot. Note that both Word2Vec and KMeans were carried out using Apache Spark's spark.ml package. For querying the result and performing window aggregation, both its DataFrame API and standard SQL were used.

In this study, we only selected the original tweets as the sample, assuming that the distribution of political opinions within original tweets can correctly reflect that among all voters. We also only took the tweets in English for the simplicity of NLP process. The word vector length was set to be 100, which is a typical value for short documents. The number of clusers was set to be 5 here for this preliminary study.

Result and Discussion

As shown below, KMeans calculation converged after considering 100 "features" generated in Word2Vec process. By comparing individual word vectors and actual tweet text with the vectors of centroids, we give the description of the 5 categories for both pre-debate and in/post-debate as:

Pre-Debate In/Post-Debate
Category 0 Neutral: Ads/Reports Neutral: Expressing frustration on this election
Category 1 Neutral: Unbiased comments Slightly Supporting Clinton
Category 2 Slightly Supporting Trump Against Trump
Category 3 Neutral: Reports Neutral: Reports
Category 4 Slightly Supporting Clinton Slightly Supporting Trump

According to the visualization, the number of tweets supporting Hillary Clinton is more than that supporting Donald Trump. This means that this study predicts Hillary Clinton is going to win this presidential election. Interestingly the visualization also shows that as the time approached the start of the 3rd presidential debate, more and more people became "neutral" or "undecided". However, after the debate started, more people "made choice", meaning both Trump and Clinton gained more supporters but Clinton still had more supporters.

Although our NLP model (Word2Vec + KMeans) predicts Hillary Clinton as the winner of this election, the aggregation study of the most popular hashtags showed different outcome. As shown below, the number of tweets that have hashtags supporting Trump is larger than that of the tweets which has hashtags supporting Clinton. This difference indicates that our NLP model may not be accurate enough and it needs more hyperparameter tuning (e.g. word vector length, number of KMeans clusters, etc.) and more careful text cleaning. It is interesting that roughly our current model can differentiate neutral opinions versus slightly biased opinions, but more categories may need to be added considering people's political view are quite complex.


Code and Procedures

Create A Connection To MongoDB

In [1]:
import pymongo
from pymongo import MongoClient
client = MongoClient('mongodb://152.3.169.27:27017/')
In [2]:
import numpy as np, scipy as sp
import pandas as pd
from pandas import DataFrame as df, Series as ss
import sklearn as sk
import ggplot, seaborn as sns
from pyspark.sql.types import *
from pyspark.sql.functions import *
%matplotlib inline
import random, string
In [3]:
import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()
In [4]:
tweets = client['db_third_debate']
pre_debate = tweets['pre_debate']
in_debate = tweets['in_debate']
In [5]:
pre_debate.find({"retweeted_status":{"$exists": False}}).count()
Out[5]:
114544
In [6]:
in_debate.find({"retweeted_status":{"$exists": False}}).count()
Out[6]:
303731

We Take Only The Original Tweets

In [7]:
pre_debate_tweets = pd.DataFrame(list(pre_debate.find({"retweeted_status":{"$exists": False}}, 
         projection={"_id": False, "text": True, "lang": True, "timestamp_ms": True, "user['location']": True, \
                     "retweet_count": True, "entities.hashtags.text": True})) )
pre_debate_tweets = pre_debate_tweets.dropna(how='any', subset=['text'])
In [8]:
in_debate_tweets = pd.DataFrame(list(in_debate.find({"retweeted_status":{"$exists": False}}, 
         projection={"_id": False, "text": True, "lang": True, "timestamp_ms": True, "user['location']": True, \
                     "retweet_count": True, "entities.hashtags.text": True})) )
in_debate_tweets = in_debate_tweets.dropna(how='any', subset=['text'])
In [9]:
pre_debate_tweets['timestamp_ms'] = pre_debate_tweets['timestamp_ms'].astype(np.int)
pre_debate_tweets['retweet_count'] = pre_debate_tweets['retweet_count'].astype(np.int)
pre_debate_tweets['tweet'] = pre_debate_tweets['text']
in_debate_tweets['timestamp_ms'] = in_debate_tweets['timestamp_ms'].astype(np.int)
in_debate_tweets['retweet_count'] = in_debate_tweets['retweet_count'].astype(np.int)
in_debate_tweets['tweet'] = in_debate_tweets['text']
In [10]:
pre_sample = random.sample(range(0, pre_debate_tweets.shape[0]), 25000)
in_sample = random.sample(range(0, in_debate_tweets.shape[0]), 25000)
In [11]:
# pre_debate_tweets_sample = pre_debate_tweets.iloc[pre_sample].copy()
pre_debate_tweets_sample = pre_debate_tweets[pre_debate_tweets['lang']=="en"].copy()
In [12]:
pre_debate_tweets_sample.shape
Out[12]:
(83095, 6)
In [13]:
# in_debate_tweets_sample = in_debate_tweets.iloc[in_sample].copy()
in_debate_tweets_sample = in_debate_tweets[in_debate_tweets['lang']=="en"].copy()
In [14]:
in_debate_tweets_sample.shape
Out[14]:
(209463, 6)

Cleaning The Text

In [15]:
class cleanTweets(object):
    
    acc = 0
    stops = set(stopwords.words("english"))  
    
    @classmethod
    def clearText(self, textString):
        '''
        @param textString: String
        '''
        # Back to plain string
        # s1 = str(textString.decode('unicode_escape').encode('ascii','ignore'))
        
        if type(textString) == 'float':
            textString = str(textString)
        s1 = str(textString.encode('ascii', 'ignore'))
        # Remove URLs
        s1 = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ', s1)
        # Remove @...s
        s1 = re.sub(r'@\w+\b', ' ', s1)
        # Remove hashtags, we will check hash tags later from tweet fields
        s1 = re.sub(r'#\w+\b', ' ', s1)
        # Punctuations, special treatment for "U.S."
        s1 = re.sub(r'U\.S\.', 'US', s1)
        # s1 = re.sub(r'[\"\'\`\,\.\-\:\{\}\!\?\<\>\[\]]|\&amp\;|\\n', ' ', s1)
        s1 = s1.translate(None, string.punctuation)

        # Apply NLTK
        # s1 = ' '.join(nltk.word_tokenize(s1)).strip()
        s1 = nltk.word_tokenize(s1)
        words = [ps.stem(w.lower()) for w in s1 if not w.lower() in self.stops ]
        
        # lowerCaseWords = [w.lower() for w in words]
        self.acc = self.acc + 1
        # print self.acc

        # return ' '.join(lowerCaseWords)
        # return lowerCaseWords
        return words
    
    @classmethod
    def collectHashTags(self, thisEntities):
        j = []
        for tag in thisEntities['hashtags']:
            j.append(str(tag[u'text'].encode('ascii', 'ignore')))
        self.acc = self.acc + 1
        return j
In [16]:
cleanTweets.acc = 0
In [17]:
pre_debate_tweets_sample['text'] = pre_debate_tweets_sample['text'].apply(cleanTweets.clearText)
In [18]:
in_debate_tweets_sample['text'] = in_debate_tweets_sample['text'].apply(cleanTweets.clearText)
In [19]:
pre_debate_tweets_sample['entities'] = pre_debate_tweets_sample['entities'].apply(cleanTweets.collectHashTags)
in_debate_tweets_sample['entities'] = in_debate_tweets_sample['entities'].apply(cleanTweets.collectHashTags)

Start SparkSession

In [20]:
# spark.conf.set('spark.executor.memory', '6G')
In [21]:
sp_pre_debate_tweets = spark.createDataFrame(pre_debate_tweets_sample)
sp_pre_debate_tweets.cache()
sp_in_debate_tweets = spark.createDataFrame(in_debate_tweets_sample)
sp_in_debate_tweets.cache()

sp_pre_debate_tweets = sp_pre_debate_tweets.withColumnRenamed('entities', 'hashtags')
sp_in_debate_tweets = sp_in_debate_tweets.withColumnRenamed('entities', 'hashtags')

Applying the pyspark.ml.feature.Word2Vec and pyspark.ml.clustering.KMeans

In [22]:
from pyspark.ml.feature import Word2Vec
word2Vec = Word2Vec(vectorSize=100, minCount=5, inputCol="text", outputCol="word_vector")
In [23]:
w2v_pre_model = word2Vec.fit(sp_pre_debate_tweets)
w2v_in_model = word2Vec.fit(sp_in_debate_tweets)
In [24]:
w2v_pre_result = w2v_pre_model.transform(sp_pre_debate_tweets)
w2v_pre_result.cache()
w2v_in_result = w2v_in_model.transform(sp_in_debate_tweets)
w2v_in_result.cache()
Out[24]:
DataFrame[hashtags: array<string>, lang: string, retweet_count: bigint, text: array<string>, timestamp_ms: bigint, tweet: string, word_vector: vector]
In [25]:
from pyspark.ml.clustering import KMeans
kmeans = KMeans(featuresCol="word_vector", predictionCol="prediction", k=5, seed=1023)
kmeans_pre_model = kmeans.fit(w2v_pre_result)
kmeans_in_model = kmeans.fit(w2v_in_result)
In [26]:
trans_w2v_pre_results = kmeans_pre_model.transform(w2v_pre_result)
trans_w2v_in_results = kmeans_in_model.transform(w2v_in_result)

Window Aggregation On Prediction

In [27]:
trans_w2v_pre_results = trans_w2v_pre_results.withColumn('time', 
                                 from_unixtime(trans_w2v_pre_results['timestamp_ms']/1000)).sort('time')
trans_w2v_in_results = trans_w2v_in_results.withColumn('time', 
                                 from_unixtime(trans_w2v_in_results['timestamp_ms']/1000)).sort('time')
In [28]:
trans_w2v_pre_results = trans_w2v_pre_results.withColumn('grid', 
                                                         window(trans_w2v_pre_results['time'], "20 minute"))
trans_w2v_in_results = trans_w2v_in_results.withColumn('grid', 
                                                         window(trans_w2v_in_results['time'], "30 minute"))
In [29]:
trans_w2v_pre_results_sql = trans_w2v_pre_results.createOrReplaceTempView("pre_result")

trans_w2v_in_results_sql = trans_w2v_in_results.createOrReplaceTempView("in_result")
In [30]:
pre_result_window = spark.sql("""
  SELECT prediction, COUNT(prediction) AS _count, grid.start
  FROM pre_result
  GROUP BY prediction, grid.start
  ORDER BY prediction, grid.start
""")

pre_result_pdf = pre_result_window.toPandas()

in_result_window = spark.sql("""
  SELECT prediction, count(prediction) AS _count, grid.start
  FROM in_result
  GROUP BY prediction, grid.start
  ORDER BY prediction, grid.start
""")

in_result_pdf = in_result_window.toPandas()
In [31]:
pre_time = list(pre_result_pdf['start'].unique())
in_time = list(in_result_pdf['start'].unique())
pre_count = []
in_count = []
for itime in pre_time:
    pre_count.append(pre_result_pdf[pre_result_pdf['start']==itime]['_count'].sum())
for itime in in_time:
    in_count.append(in_result_pdf[in_result_pdf['start']==itime]['_count'].sum())
In [32]:
pre_result_pdf['percentage'] = pre_result_pdf['_count']
in_result_pdf['percentage'] = in_result_pdf['_count']
for index, row in pre_result_pdf.iterrows():
    pre_result_pdf.loc[index, 'percentage'] = float(row['_count']) / \
    float(pre_count[pre_time.index(np.datetime64(row['start']))])
    # print float(row['_count']) / float(pre_count[row['prediction']])
for index, row in in_result_pdf.iterrows():
    in_result_pdf.loc[index, 'percentage'] = float(row['_count']) / \
    float(in_count[in_time.index(np.datetime64(row['start']))])
In [33]:
import cufflinks as cf
print cf.__version__
import plotly.plotly as py
0.8.2
In [34]:
pre_result_pdf_pivot = pre_result_pdf.pivot('start', 'prediction')['percentage']
in_result_pdf_pivot = in_result_pdf.pivot('start', 'prediction')['percentage']
In [35]:
pre_result_pdf_pivot.index
Out[35]:
DatetimeIndex(['2016-10-19 19:20:00', '2016-10-19 19:40:00',
               '2016-10-19 20:00:00', '2016-10-19 20:20:00',
               '2016-10-19 20:40:00', '2016-10-19 21:00:00'],
              dtype='datetime64[ns]', name=u'start', freq=None)
In [36]:
cf.set_config_file(offline=True, world_readable=True, theme='ggplot')
pre_result_pdf_pivot.iplot(kind='barh',barmode='stack', bargap=.3, filename='cf_pre_debate')
In [37]:
cf.set_config_file(offline=True, world_readable=True, theme='ggplot')
in_result_pdf_pivot.iplot(kind='barh',barmode='stack', bargap=.3, filename='cf_in_debate')

What are they ???

In [38]:
for category in range(5):
    print "# ===========> Category {0} <=========== #".format(category)
    trans_w2v_pre_results.filter(trans_w2v_pre_results['prediction']==category).select('tweet').show(10, False)
# ===========> Category 0 <=========== #
+---------------------------------------------------------------------------------------------------------------------------------------+
|tweet                                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------------------+
|Mexico regulators ordered bank stress tests to model Trump victory: sources https://t.co/4TffHlncgT brandnaware https://t.co/gDFItyYKvf|
|Mexico regulators ordered bank stress tests to model Trump victory -  https://t.co/j2pLdrTNuy https://t.co/e7qP3zG6eU                  |
|Mexico regulators ordered bank stress tests to model Trump victory: sources https://t.co/zWCjsDZfJr                                    |
|Mexico regulators ordered bank stress tests to model Trump victory: sources: Mexico's financial… https://t.co/kyhLouHnOH #election     |
|Mexico regulators ordered bank stress tests to model Trump victory: sources https://t.co/T9ZCCIUiP8                                    |
|Mexico regulators ordered bank stress tests to model Trump victory: sources https://t.co/219yI3hthK ^Re                                |
|MEXICO ORDERED BANK STRESS TESTS MODELING TRUMP VICTORY: RTRS                                                                          |
|Mexico regulators ordered bank stress tests to model Trump victory: sources https://t.co/CAb7y5AxQE                                    |
|zerohedge: MEXICO ORDERED BANK STRESS TESTS MODELING TRUMP VICTORY: RTRS                                                               |
|#news #Mexico regulators ordered bank stress tests to model Trump victory: sources #business #fdlx                                     |
+---------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

# ===========> Category 1 <=========== #
+------------------------------------------------------------------------------------------------------+
|tweet                                                                                                 |
+------------------------------------------------------------------------------------------------------+
|Turn on the #debate #BeRomanticIn4Words                                                               |
|Let's get it on! #debate                                                                              |
|Let's get it on!
#debatenight #DebateHeadache https://t.co/dptp4m503n                                 |
|#Debate ~ Let's turn up the heat on Hillary folks ~ https://t.co/e7COtWPO74 ~~ https://t.co/58YIxt6paW|
|What will you be drinking during the #Debates ?                                                       |
|@FoxNews @mike_pence https://t.co/asAUwVkn2m PLEASE VOTE FOLKS!!!                                     |
|@Ash4President @AshvsEvilDead PLEASE VOTE FOLKS!!! https://t.co/asAUwVkn2m                            |
|Turn off the #debate #BeRomanticIn4Words                                                              |
|#Debate ~ Let's turn up the heat on Hillary folks ~ https://t.co/e7COtWPO74 ~~ https://t.co/okf1FP0GJk|
|#Debate ~ Let's turn up the heat on Hillary folks ~ https://t.co/D5suH2uloH https://t.co/cXlqiy9tV6   |
+------------------------------------------------------------------------------------------------------+
only showing top 10 rows

# ===========> Category 2 <=========== #
+--------------------------------------------------------------------------------------------------------------------------------------------+
|tweet                                                                                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|'Bowl of Skittles' photographer sues Trumps, Pence     - CNET: Lawsuit alleges Trump campaign did not have pe... https://t.co/Z6CuAFSfcq    |
|⚡️ “Bernie's Best Trump Takedowns” by @BernieSanders

https://t.co/wjDaDMBU45                                                               |
|Love this! Donald Trump sings The Muppets' Mahna Mahna ... https://t.co/qjYBO17Hc9                                                          |
|Hillary could have sewn up election if she invited Alec Baldwin dressed as #Trump to sit in front row #debatenight                          |
|Amy Schumer booed by Donald Trump supporters https://t.co/jWnobvJlg9                                                                        |
|"My account was hacked." https://t.co/EzHN3Ud1VT                                                                                            |
|'Bowl of Skittles' photographer sues Trumps, Pence     - CNET: Lawsuit alleges Trump campaign did not have pe... https://t.co/vVGaXAmExG    |
|"Bookings at Trump's hotels were down 59 percent during the first half of 2016, according to the travel site Hipmun… https://t.co/GnzQib57FL|
|'Bowl of Skittles' photographer sues Trumps, Pence     - CNET: Lawsuit alleges Trump campaign did not have pe... https://t.co/TgI81k7shq    |
|Drain the Swamp!! Trump 2016                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

# ===========> Category 3 <=========== #
+------------------------------------------------------------------------------------------------------------------------------------------------+
|tweet                                                                                                                                           |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|U.S. Presidential candidates, @HillaryClinton &amp;  @realDonaldTrump in final face-off tonight.Follow the #debate live https://t.co/lHByDNmC5X"|
|Another presidential debate come on tonight...ugh                                                                                               |
|Palin to be a Trump guest at tonight's debate https://t.co/GBflmHqCQ7 https://t.co/0IhxUb85GA                                                   |
|Palin to be a Trump guest at tonight's debate https://t.co/P3QQgVuDmM https://t.co/Xj4bv4j4Cv                                                   |
|Palin to be a Trump guest at tonight's debate https://t.co/Nnu3IQfsl7 https://t.co/7OIO6uobCV                                                   |
|{UAH} Unshackled Trump and confident Clinton to clash in final presidential debate: Donald… https://t.co/V7g2qLr00O                             |
|My plan for tonight's #debate: watch #Arrow and #Frequency instead.                                                                             |
|RT nytvideo: Fox News is in the spotlight with Chris Wallace moderating tonight's #debate https://t.co/aCHL8zXcEw                               |
|#ImWithHer and will be watching the final #debate LIVE on Twitter tonight! https://t.co/86mASjefW0 https://t.co/Y7uwwm6d8p                      |
|Trump, Clinton gear up for final face-off: Donald Trump heads into his final debate clash with Hillary Clinto... https://t.co/K7dbCGSMia        |
+------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

# ===========> Category 4 <=========== #
+--------------------------------------------------------------------------------------------------------------------------------------------+
|tweet                                                                                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Fox Chapel doctor’s anti-Trump yard signs have a distinct Pittsburgh flavor https://t.co/mKTRJC9UIU                                         |
|This makes marginally less sense than the half-brother thing. https://t.co/Os8hYBUJcP                                                       |
|@RealDJTrumpTeam Trump strong. Go Trump                                                                                                     |
|I feel like Trump and Hillary are divorced parents fighting over custody of us, but we just wanna go with grandma.                          |
|@BretBaier @megynkelly @FoxNews   Trump hating crew there Bret.                                                                             |
|And Van says they are booing Trump...just then they break into "LOCK HER UP! LOCK HER UP" 😂 #Debate #DrainTheSwamp https://t.co/cp51TJgyH8 |
|@LouDobbs poor #cnn, been listening on radio pre #debatenight to see if they have changed? NOPE still same bies led… https://t.co/6GYtuADBcf|
|I HIGHLY DOUBT THAT.  He is another @FoxNews Trump hater. https://t.co/ABdQuljJ1N                                                           |
|REPPIN! https://t.co/ZI4sz0bMB9                                                                                                             |
|@thedailybeast Yes, anti-intellectual Donald Trump's 'go to' source for information and the publication he thinks deserves a Pulitzer.      |
+--------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

In [39]:
for category in range(5):
    print "# ===========> Category {0} <=========== #".format(category)
    trans_w2v_in_results.filter(trans_w2v_in_results['prediction']==category).select('tweet').show(10, False)
# ===========> Category 0 <=========== #
+-------------------------------------------------------------------------------------------------------------------------------------------+
|tweet                                                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------------------------+
|Let's rock this mutha @realDonaldTrump #Debatenight                                                                                        |
|This is gonna be just great.... #debatenight                                                                                               |
|Oh gosh. Here we go with #debatenight not sure I'm ready for this.                                                                         |
|#cornmeal crusted #redsnapper and #fries. #fishandchips #presidentialdebate #worstelectionever… https://t.co/6xRxmXDBQM                    |
|God be with you Chris Wallace #PresidentialDebate                                                                                          |
|Eck. Did it have to start with the F-word? #fauxnews #debatenight                                                                          |
|Watching the president of Las Vegas Convention Center speak thinking he should run #debates                                                |
|WOOOOOO I FEEL LIKE IM FUCKING WATCHING A UFC FIGHT. FUCK YEAHHH #debatenight                                                              |
|End them all. https://t.co/JxHKv3OfLj                                                                                                      |
|BBCWorld: RT awzurcher: A look at the media bleachers just minutes before the #PresidentialDebate kicks off. I'll … https://t.co/Hzao8WMI3r|
+-------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

# ===========> Category 1 <=========== #
+----------------------------------------------------------------------------------------------------------------------------------------+
|tweet                                                                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------------------+
|Who do I vote for? #debatenight                                                                                                         |
|I just told @nbcwashington who I think is winning. Cast your vote now! #Debates #NBC4DC https://t.co/Ap20jEuq62                         |
|@Couples_Coop what do you think? https://t.co/vAm0heQW94                                                                                |
|@cnni @HillaryClinton wins! @CNNLIVE_ #debate https://t.co/vdNZrQtZ0S                                                                   |
|Who you voting for? #debatenight                                                                                                        |
|@jk_rowling What do you think about the #debatenight                                                                                    |
|He has my vote. https://t.co/BopShVNyaJ                                                                                                 |
|I just told @nbcwashington who I think is winning. Cast your vote now! #Debates #NBC4DC https://t.co/EwPfcQIMr3                         |
|#Debate...@HillaryClinton is winning! #Trump #FeelTheBern @HillaryClinton @GovGaryJohnson @DrJillStein #DumpTrump #NeverTrump #HRC #Bern|
|I just told @nbcwashington who I think is winning. Cast your vote now! #Debates #NBC4DC https://t.co/vRGzFtlx04                         |
+----------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

# ===========> Category 2 <=========== #
+---------------------------------------------------------------------------------------------------------------------------------+
|tweet                                                                                                                            |
+---------------------------------------------------------------------------------------------------------------------------------+
|hell begun #debatenight                                                                                                          |
|@OuijaKnowsAll will trump win the presidential election in 20 days?                                                              |
|RNC: HILLARY IS LOSING.
#DebateNight                                                                                             |
|What the Hell is she talking about?  #Debate                                                                                     |
|SCOTUS is very important this election. #debatenight                                                                             |
|.@SenJohnMcCain said no #SupremeCourt nominee would be confirmed if @HillaryClinton is elected. That's not American! #debatenight|
|I don't know whether to use the #debatenight tag or #shitshow. Or #sniffleupagus.                                                |
|Don't know whether I should be watching the #Debate or the #NLCS                                                                 |
|Americans will have no rights if #HillaryClinton is elected POTUS #debatenight #debate                                           |
|What the hell is Trump doing? #debatenight                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

# ===========> Category 3 <=========== #
+--------------------------------------------------------------------------------------------------------------------------------------+
|tweet                                                                                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------+
|Reuters US: FOREX-Dollar steady as investors await U.S. presidential debate, ECB https://t.co/rlEsBqOGQf                              |
|Watching the Usa Presidential debate.... not from the usa and i am stressed out!                                                      |
|It figures the last Presidential debate would be in Sin City. #prophetic #debatenight                                                 |
|I'm on the #TrumpTrain and will be watching the final #debate LIVE on Twitter tonight! https://t.co/56PQyZ52lc https://t.co/8QUAz1RPjN|
|Awake to watch the final. I expect fireworks #debatenight                                                                             |
|@In2TheSunshine2 And the Presidential Debate is on. #debate                                                                           |
|WATCH: The Third 2016 Presidential Debate (LIVE STREAM) https://t.co/NVbOGwixem https://t.co/eXVM4oGkLe                               |
|Me watching the #debates https://t.co/dCXFmnb67k                                                                                      |
|WATCH: The Third 2016 Presidential Debate (LIVE STREAM) https://t.co/tSZXHIGdPg https://t.co/r5pnmxmVaF                               |
|watching the presidential debate 2016                                                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

# ===========> Category 4 <=========== #
+-----------------------------------------------------------------------------------------------------------------------------------------+
|tweet                                                                                                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------+
|DONALD TRUMP FC #debatenight                                                                                                             |
|Fact checkers are about to go through hell #debatenight                                                                                  |
|Go Go, Trump!!!! The people are with you! #debatenight                                                                                   |
|J is for Jackass #debatenight #DonaldJTrump                                                                                              |
|@peterdaou Trump was grabbed and Hillary was yelled at! Not same circumstance. Besides Hillary looked like she pushed herself            |
|@FaZe_Censor low key want trump to win.:.                                                                                                |
|There goes Megyn again...talking about Trump's issues! Sigh...#Debate #FoxNews                                                           |
|When the brother of the President of The Untied States of America is voting for Donald Trump, that should tell you something!!
#Trump2016|
|Legends only https://t.co/WIujQnUEnm                                                                                                     |
|@katiepack @newtgingrich lil Trump                                                                                                       |
+-----------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

Top Ten Word

In [40]:
pre_debate_words = []
for index, row in pre_debate_tweets_sample.iterrows():
    pre_debate_words.append(row['text'])
pre_debate_words_flat = [val for sublist in pre_debate_words for val in sublist]
In [41]:
len(pre_debate_words_flat)
Out[41]:
663585
In [42]:
in_debate_words = []
for index, row in in_debate_tweets_sample.iterrows():
    in_debate_words.append(row['text'])
in_debate_words_flat = [val for sublist in in_debate_words for val in sublist]
In [43]:
len(in_debate_words_flat)
Out[43]:
1526647
In [44]:
pre_debate_words_df = spark.createDataFrame(pd.DataFrame(pre_debate_words_flat, columns=['word']))
pre_debate_words_window = pre_debate_words_df.createOrReplaceTempView("pre_debate_words_window")
In [45]:
spark.sql("""
  SELECT word, COUNT(word) AS count
  FROM pre_debate_words_window
  GROUP BY word
  ORDER BY count DESC
""").show(10)
+----------+-----+
|      word|count|
+----------+-----+
|     trump|44167|
|     debat|14081|
|presidenti| 9209|
|   tonight| 7812|
|    donald| 7652|
|     watch| 7429|
|     final| 7015|
|   clinton| 6738|
|   hillari| 6452|
|      live| 5518|
+----------+-----+
only showing top 10 rows

In [46]:
in_debate_words_df = spark.createDataFrame(pd.DataFrame(in_debate_words_flat, columns=['word']))
in_debate_words_window = in_debate_words_df.createOrReplaceTempView("in_debate_words_window")
In [47]:
spark.sql("""
  SELECT word, COUNT(word) AS count
  FROM in_debate_words_window
  GROUP BY word
  ORDER BY count DESC
""").show(10)
+----------+-----+
|      word|count|
+----------+-----+
|     trump|94843|
|     debat|22079|
|   hillari|21397|
|    donald|18828|
|   clinton|16445|
|       say|13824|
|presidenti|11677|
|      like|11443|
|      vote|10905|
|     elect|10347|
+----------+-----+
only showing top 10 rows

Top Ten #HashTags

In [48]:
pre_debate_tags = []
for index, row in pre_debate_tweets_sample.iterrows():
    pre_debate_tags.append(row['entities'])
pre_debate_tags_flat = [val for sublist in pre_debate_tags for val in sublist]
In [49]:
in_debate_tags = []
for index, row in in_debate_tweets_sample.iterrows():
    in_debate_tags.append(row['entities'])
in_debate_tags_flat = [val for sublist in in_debate_tags for val in sublist]
In [50]:
pre_debate_tags_df = spark.createDataFrame(pd.DataFrame(pre_debate_tags_flat, columns=['tags']))
pre_debate_tags_window = pre_debate_tags_df.createOrReplaceTempView("pre_debate_tags_window")
in_debate_tags_df = spark.createDataFrame(pd.DataFrame(in_debate_tags_flat, columns=['tags']))
in_debate_tags_window = in_debate_tags_df.createOrReplaceTempView("in_debate_tags_window")
In [51]:
spark.sql("""
  SELECT tags, COUNT(tags) AS tag_count
  FROM pre_debate_tags_window
  GROUP BY tags
  ORDER BY tag_count DESC
""").show(30)
+--------------------+---------+
|                tags|tag_count|
+--------------------+---------+
|              debate|    14929|
|         debatenight|     7337|
|              Debate|     2771|
|               Trump|     2594|
|  PresidentialDebate|     1505|
|           ImWithHer|     1323|
|             debates|     1194|
|          TrumpTrain|     1128|
|             Hillary|      967|
|         DebateNight|      795|
|             Clinton|      617|
|                MAGA|      613|
|               trump|      595|
|       DrainTheSwamp|      498|
|      DebateHeadache|      481|
|          debate2016|      470|
|             Debates|      368|
|        Election2016|      350|
|         Debatenight|      306|
|          NeverTrump|      267|
|                 CNN|      235|
|         Debates2016|      234|
|MakeAmericaGreatA...|      201|
|                news|      197|
|        TrumpPence16|      196|
|      HillaryClinton|      193|
|         DonaldTrump|      180|
|RejectedHillarySl...|      175|
|                tcot|      172|
|                News|      141|
+--------------------+---------+
only showing top 30 rows

In [52]:
print "pre-debate:\nTrump\t:\t{0}\nClinton\t:\t{1}".format(2594 + 1128 + 613 + 595 + 498 + 201 + 196 + 180 + 175, 
                                                           1323 + 967 + 617 + 267 + 193)
pre-debate:
Trump	:	6180
Clinton	:	3367
In [53]:
spark.sql("""
  SELECT tags, COUNT(tags) AS tag_count
  FROM in_debate_tags_window
  GROUP BY tags
  ORDER BY tag_count DESC
""").show(30)
+------------------+---------+
|              tags|tag_count|
+------------------+---------+
|       debatenight|    57764|
|            debate|    26002|
|       DebateNight|     6000|
|            Debate|     4744|
|             Trump|     4393|
|           Debates|     4263|
|           debates|     3523|
|            NBC4DC|     3446|
|       Debatenight|     2196|
|         ImWithHer|     2062|
|PresidentialDebate|     1922|
|           Hillary|     1708|
|        debate2016|     1239|
|             trump|     1154|
|              MAGA|     1048|
|    DebateHeadache|     1025|
|           Clinton|      801|
|       Debates2016|      761|
|         FactCheck|      715|
|     TrumpDebateCT|      703|
|             NBCCT|      694|
|    HillaryClinton|      659|
|       DonaldTrump|      620|
|      Election2016|      571|
|      ChrisWallace|      505|
|        NeverTrump|      502|
|     DrainTheSwamp|      472|
|         imwithher|      456|
|        TrumpTrain|      433|
|              news|      417|
+------------------+---------+
only showing top 30 rows

More Visualizations using ggplot for offline view

In [54]:
ggplot.ggplot(pre_result_pdf, ggplot.aes(x='start', weight='percentage', fill='factor(prediction)')) \
+ ggplot.geom_bar()
Out[54]:
<ggplot: (8788423083181)>
In [55]:
ggplot.ggplot(in_result_pdf, ggplot.aes(x='start', weight='percentage', fill='factor(prediction)')) \
+ ggplot.geom_bar()
Out[55]:
<ggplot: (8788423083085)>